Part of Udacity Data Analyst Nanodegree

Introduction

For the EDA project, the data set is the Prosper Loan Data. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The proper variable definations for the data can be found here.

#supress warnings 
oldw <- getOption("warn")

#load all the libraries needed. 
library(rmarkdown)
library(ggplot2)
library(ggthemes)
library(gridExtra)
library(scales)
library(dplyr)
library(GGally)
library(tidyr)
library(maps)
library(openintro)
library(ggpubr)
library(lubridate)

Data

Let’s start this exploration with loading the data and saving the data into a dataframe to be inspected.

#load dataset
df <- read.csv('prosperLoanData.csv', na.strings=c("NA", "NULL",""))
#check dataset
head(df)
##                ListingKey ListingNumber           ListingCreationDate
## 1 1021339766868145413AB3B        193129 2007-08-26 19:09:29.263000000
## 2 10273602499503308B223C1       1209647 2014-02-27 08:28:07.900000000
## 3 0EE9337825851032864889A         81716 2007-01-05 15:00:47.090000000
## 4 0EF5356002482715299901A        658116 2012-10-22 11:02:35.010000000
## 5 0F023589499656230C5E3E2        909464 2013-09-14 18:38:39.097000000
## 6 0F05359734824199381F61D       1074836 2013-12-14 08:26:37.093000000
##   CreditGrade Term LoanStatus          ClosedDate BorrowerAPR BorrowerRate
## 1           C   36  Completed 2009-08-14 00:00:00     0.16516       0.1580
## 2        <NA>   36    Current                <NA>     0.12016       0.0920
## 3          HR   36  Completed 2009-12-17 00:00:00     0.28269       0.2750
## 4        <NA>   36    Current                <NA>     0.12528       0.0974
## 5        <NA>   36    Current                <NA>     0.24614       0.2085
## 6        <NA>   60    Current                <NA>     0.15425       0.1314
##   LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## 1      0.1380                      NA            NA              NA
## 2      0.0820                 0.07960        0.0249         0.05470
## 3      0.2400                      NA            NA              NA
## 4      0.0874                 0.08490        0.0249         0.06000
## 5      0.1985                 0.18316        0.0925         0.09066
## 6      0.1214                 0.11567        0.0449         0.07077
##   ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## 1                      NA                  <NA>           NA
## 2                       6                     A            7
## 3                      NA                  <NA>           NA
## 4                       6                     A            9
## 5                       3                     D            4
## 6                       5                     B           10
##   ListingCategory..numeric. BorrowerState    Occupation EmploymentStatus
## 1                         0            CO         Other    Self-employed
## 2                         2            CO  Professional         Employed
## 3                         0            GA         Other    Not available
## 4                        16            GA Skilled Labor         Employed
## 5                         2            MN     Executive         Employed
## 6                         1            NM  Professional         Employed
##   EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## 1                        2                True             True
## 2                       44               False            False
## 3                       NA               False             True
## 4                      113                True            False
## 5                       44                True            False
## 6                       82                True            False
##                  GroupKey              DateCreditPulled
## 1                    <NA> 2007-08-26 18:41:46.780000000
## 2                    <NA>           2014-02-27 08:28:14
## 3 783C3371218786870A73D20 2007-01-02 14:09:10.060000000
## 4                    <NA>           2012-10-22 11:02:32
## 5                    <NA>           2013-09-14 18:38:44
## 6                    <NA>           2013-12-14 08:26:40
##   CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## 1                   640                   659     2001-10-11 00:00:00
## 2                   680                   699     1996-03-18 00:00:00
## 3                   480                   499     2002-07-27 00:00:00
## 4                   800                   819     1983-02-28 00:00:00
## 5                   680                   699     2004-02-20 00:00:00
## 6                   740                   759     1973-03-01 00:00:00
##   CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## 1                  5               4                         12
## 2                 14              14                         29
## 3                 NA              NA                          3
## 4                  5               5                         29
## 5                 19              19                         49
## 6                 21              17                         49
##   OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## 1                     1                          24                    3
## 2                    13                         389                    3
## 3                     0                           0                    0
## 4                     7                         115                    0
## 5                     6                         220                    1
## 6                    13                        1410                    0
##   TotalInquiries CurrentDelinquencies AmountDelinquent
## 1              3                    2              472
## 2              5                    0                0
## 3              1                    1               NA
## 4              1                    4            10056
## 5              9                    0                0
## 6              2                    0                0
##   DelinquenciesLast7Years PublicRecordsLast10Years
## 1                       4                        0
## 2                       0                        1
## 3                       0                        0
## 4                      14                        0
## 5                       0                        0
## 6                       0                        0
##   PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## 1                         0                      0                0.00
## 2                         0                   3989                0.21
## 3                        NA                     NA                  NA
## 4                         0                   1444                0.04
## 5                         0                   6193                0.81
## 6                         0                  62999                0.39
##   AvailableBankcardCredit TotalTrades TradesNeverDelinquent..percentage.
## 1                    1500          11                               0.81
## 2                   10266          29                               1.00
## 3                      NA          NA                                 NA
## 4                   30754          26                               0.76
## 5                     695          39                               0.95
## 6                   86509          47                               1.00
##   TradesOpenedLast6Months DebtToIncomeRatio    IncomeRange
## 1                       0              0.17 $25,000-49,999
## 2                       2              0.18 $50,000-74,999
## 3                      NA              0.06  Not displayed
## 4                       0              0.15 $25,000-49,999
## 5                       2              0.26      $100,000+
## 6                       0              0.36      $100,000+
##   IncomeVerifiable StatedMonthlyIncome                 LoanKey
## 1             True            3083.333 E33A3400205839220442E84
## 2             True            6125.000 9E3B37071505919926B1D82
## 3             True            2083.333 6954337960046817851BCB2
## 4             True            2875.000 A0393664465886295619C51
## 5             True            9583.333 A180369302188889200689E
## 6             True            8333.333 C3D63702273952547E79520
##   TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments
## 1                NA                         NA                    NA
## 2                NA                         NA                    NA
## 3                NA                         NA                    NA
## 4                NA                         NA                    NA
## 5                 1                         11                    11
## 6                NA                         NA                    NA
##   ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 1                                  NA                              NA
## 2                                  NA                              NA
## 3                                  NA                              NA
## 4                                  NA                              NA
## 5                                   0                               0
## 6                                  NA                              NA
##   ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 1                       NA                          NA
## 2                       NA                          NA
## 3                       NA                          NA
## 4                       NA                          NA
## 5                    11000                      9947.9
## 6                       NA                          NA
##   ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 1                          NA                         0
## 2                          NA                         0
## 3                          NA                         0
## 4                          NA                         0
## 5                          NA                         0
## 6                          NA                         0
##   LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## 1                            NA                         78      19141
## 2                            NA                          0     134815
## 3                            NA                         86       6466
## 4                            NA                         16      77296
## 5                            NA                          6     102670
## 6                            NA                          3     123257
##   LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## 1               9425 2007-09-12 00:00:00                Q3 2007
## 2              10000 2014-03-03 00:00:00                Q1 2014
## 3               3001 2007-01-17 00:00:00                Q1 2007
## 4              10000 2012-11-01 00:00:00                Q4 2012
## 5              15000 2013-09-20 00:00:00                Q3 2013
## 6              15000 2013-12-24 00:00:00                Q4 2013
##                 MemberKey MonthlyLoanPayment LP_CustomerPayments
## 1 1F3E3376408759268057EDA             330.43            11396.14
## 2 1D13370546739025387B2F4             318.93                0.00
## 3 5F7033715035555618FA612             123.32             4186.63
## 4 9ADE356069835475068C6D2             321.45             5143.20
## 5 36CE356043264555721F06C             563.97             2819.85
## 6 874A3701157341738DE458F             342.37              679.34
##   LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## 1                      9425.00            1971.14        -133.18
## 2                         0.00               0.00           0.00
## 3                      3001.00            1185.63         -24.20
## 4                      4091.09            1052.11        -108.01
## 5                      1563.22            1256.63         -60.27
## 6                       351.89             327.45         -25.33
##   LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 1                 0                     0                   0
## 2                 0                     0                   0
## 3                 0                     0                   0
## 4                 0                     0                   0
## 5                 0                     0                   0
## 6                 0                     0                   0
##   LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## 1                               0             1               0
## 2                               0             1               0
## 3                               0             1               0
## 4                               0             1               0
## 5                               0             1               0
## 6                               0             1               0
##   InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## 1                          0                           0       258
## 2                          0                           0         1
## 3                          0                           0        41
## 4                          0                           0       158
## 5                          0                           0        20
## 6                          0                           0         1
#check length of dataset
nrow(df)
## [1] 113937

Considering that this data has 81 columns and 113937 rows, we can easily choose a chunk of this data to do analysis on. I shall start with dividing the data set into 2 considering the scenario pre-2009 and post-2009. This is also how the data is divided as columns such as the CreditGrade are only applicable for loans pre-July 2009. Thus, this data obviously tells a story of the pre-and-post 2009 scenarios and that is what I shall try to explore in this analysis.

Giving a little background on my decision to divide the data is important. From 2007 to mid 2009, the Great Financial Crisis occured followed by the Great Recession. The precipitating factor for the this Crisis was a high default rate in the United States subprime home mortgage sector i.e home loans. Critics argued that credit rating agencies and investors failed to accurately price the risk involved with mortgage-related financial products, and that governments did not adjust their regulatory practices to address 21st-century financial markets.

Thus, I will divide the data set into the pre and post financial crisis data set to get an accurate depection of the differences, specially, the credit grade and the numerous columns such as prosper score found only after 2009.

#start by removing some of the unnessecary columns
df <- df[ , !names(df) %in% c('ListingKey','LoanKey','ListingCreationDate','DateCreditPulled','MemberKey')]

# separating the rows on date before and after july 31 2009. 
df$Date <- as.Date(df$LoanOriginationDate)
pre_df <- subset(df, Date <= as.Date("2009-07-31"))
post_df <- subset(df, Date > as.Date("2009-07-31"))

Let’s check the two dataframes

PRE JUNE 2009

#number of rows,
print(nrow(pre_df))
## [1] 28973
#number of empty rows in each column
na_count <-sapply(pre_df, function(y) sum(length(which(is.na(y)))))
na_count
##                       ListingNumber                         CreditGrade 
##                                   0                                  20 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               28957                               28957 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               28957                               28957 
##               ProsperRating..Alpha.                        ProsperScore 
##                               28957                               28957 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                5515 
##                          Occupation                    EmploymentStatus 
##                                2255                                2255 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7606                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                               17678 
##               CreditScoreRangeLower               CreditScoreRangeUpper 
##                                 591                                 591 
##             FirstRecordedCreditLine                  CurrentCreditLines 
##                                 697                                7604 
##                     OpenCreditLines          TotalCreditLinespast7years 
##                                7604                                 697 
##               OpenRevolvingAccounts         OpenRevolvingMonthlyPayment 
##                                   0                                   0 
##                InquiriesLast6Months                      TotalInquiries 
##                                 697                                1159 
##                CurrentDelinquencies                    AmountDelinquent 
##                                 697                                7622 
##             DelinquenciesLast7Years            PublicRecordsLast10Years 
##                                 990                                 697 
##           PublicRecordsLast12Months              RevolvingCreditBalance 
##                                7604                                7604 
##                 BankcardUtilization             AvailableBankcardCredit 
##                                7604                                7544 
##                         TotalTrades  TradesNeverDelinquent..percentage. 
##                                7544                                7544 
##             TradesOpenedLast6Months                   DebtToIncomeRatio 
##                                7544                                1248 
##                         IncomeRange                    IncomeVerifiable 
##                                   0                                   0 
##                 StatedMonthlyIncome                   TotalProsperLoans 
##                                   0                               26730 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               26730                               26730 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               26730                               26730 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               26730                               26730 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               26731                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               18272                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                  MonthlyLoanPayment                 LP_CustomerPayments 
##                                   0                                   0 
##        LP_CustomerPrincipalPayments                  LP_InterestandFees 
##                                   0                                   0 
##                      LP_ServiceFees                   LP_CollectionFees 
##                                   0                                   0 
##               LP_GrossPrincipalLoss                 LP_NetPrincipalLoss 
##                                   0                                   0 
##     LP_NonPrincipalRecoverypayments                       PercentFunded 
##                                   0                                   0 
##                     Recommendations          InvestmentFromFriendsCount 
##                                   0                                   0 
##         InvestmentFromFriendsAmount                           Investors 
##                                   0                                   0 
##                                Date 
##                                   0

** POST 2009 DF **

print(nrow(post_df))
## [1] 84964
na_count <-sapply(post_df, function(y) sum(length(which(is.na(y)))))
na_count
##                       ListingNumber                         CreditGrade 
##                                   0                               84964 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                               58848                                   0 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                                 127                                 127 
##                     EstimatedReturn             ProsperRating..numeric. 
##                                 127                                 127 
##               ProsperRating..Alpha.                        ProsperScore 
##                                 127                                 127 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                1333                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                  19                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                               82918 
##               CreditScoreRangeLower               CreditScoreRangeUpper 
##                                   0                                   0 
##             FirstRecordedCreditLine                  CurrentCreditLines 
##                                   0                                   0 
##                     OpenCreditLines          TotalCreditLinespast7years 
##                                   0                                   0 
##               OpenRevolvingAccounts         OpenRevolvingMonthlyPayment 
##                                   0                                   0 
##                InquiriesLast6Months                      TotalInquiries 
##                                   0                                   0 
##                CurrentDelinquencies                    AmountDelinquent 
##                                   0                                   0 
##             DelinquenciesLast7Years            PublicRecordsLast10Years 
##                                   0                                   0 
##           PublicRecordsLast12Months              RevolvingCreditBalance 
##                                   0                                   0 
##                 BankcardUtilization             AvailableBankcardCredit 
##                                   0                                   0 
##                         TotalTrades  TradesNeverDelinquent..percentage. 
##                                   0                                   0 
##             TradesOpenedLast6Months                   DebtToIncomeRatio 
##                                   0                                7306 
##                         IncomeRange                    IncomeVerifiable 
##                                   0                                   0 
##                 StatedMonthlyIncome                   TotalProsperLoans 
##                                   0                               65122 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               65122                               65122 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               65122                               65122 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               65122                               65122 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               68278                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               78713                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                  MonthlyLoanPayment                 LP_CustomerPayments 
##                                   0                                   0 
##        LP_CustomerPrincipalPayments                  LP_InterestandFees 
##                                   0                                   0 
##                      LP_ServiceFees                   LP_CollectionFees 
##                                   0                                   0 
##               LP_GrossPrincipalLoss                 LP_NetPrincipalLoss 
##                                   0                                   0 
##     LP_NonPrincipalRecoverypayments                       PercentFunded 
##                                   0                                   0 
##                     Recommendations          InvestmentFromFriendsCount 
##                                   0                                   0 
##         InvestmentFromFriendsAmount                           Investors 
##                                   0                                   0 
##                                Date 
##                                   0

So there is an obvious difference in the loan data pre-2009 and post-2009. For starters, the number of rows in each set is quite different. I shall have to make use of ratios and percentages to do my analysis.

Analysis

Category of Loan

I want to see if there was a difference pre or post 2009 for which the loans were taken. Considering this is a discrete variable a bar chart would be a good choice for this type of analysis.

table(pre_df$ListingCategory..numeric.)
## 
##     0     1     2     3     4     5     6     7 
## 16945  5071   623  1874  2395   476   329  1260
table(post_df$ListingCategory..numeric.)
## 
##     0     1     2     3     5     6     7     8     9    10    11    12 
##    20 53237  6810  5315   280  2243  9234   199    85    91   217    59 
##    13    14    15    16    17    18    19    20 
##  1996   876  1522   304    52   885   768   771
ggplot(as.data.frame(table(post_df$ListingCategory..numeric.)), aes(x= factor(Var1) , y = Freq/sum(Freq))) +
 geom_bar(stat="identity") +
 ylab("Percentage") +
 xlab(NULL) +
 scale_y_continuous(labels=scales::percent) +
 scale_x_discrete(breaks=0:20, labels = stringr::str_wrap(c("Not Available", "Debt Consolidation", "Home Improvement", "Business", "Personal Loan", "Student Use", "Auto", "Other", "Baby&Adoption", "Boat", "Cosmetic Procedure", "Engagement Ring", "Green Loans","Household Expenses","Large Purchases","Medical/Dental","Motorcycle","RV","Taxes","Vacation","Wedding Loans"),   width =10)) +
 theme(axis.text.x = element_text(angle = 90)) +
 ggtitle('Post Crisis Categories of Loans')

ggplot(as.data.frame(table(pre_df$ListingCategory..numeric.)),
              aes(x=factor(Var1), y = Freq/sum(Freq))) +
  geom_bar(stat="identity", width = 0.75) +
  ylab("Percentage") +
  xlab(NULL) +
  scale_y_continuous(labels=scales::percent) +
  scale_x_discrete(breaks=0:7, labels = stringr::str_wrap(c("Not Available", "Debt Consolidation", "Home Improvement", "Business", "Personal Loan", "Student Use", "Auto", "Other") , width =10)) +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle('Pre Crisis Categories of Loans')

Oddly enough, individuals are taking loas for various different reasons after the crisis. This seems to be inline with the article here which highlights that since the recession is over and people have their jobs back, there seems to a trust in the industry again. So we see quite a few differences between the pre-2009 loan categories and the post-2009 loan categories.

  • Post Crisis, loans are taken for quite a few different reasons including luxury items such as buying a boat or an RV. Maybe prices are up and people could afford luxury items before
  • Pre crisis, we can see that, while we do not have the loan category available for almost 60% of the loans, there seems to still be a trend present i.e. that there is no luxury spendings. Loans were only taken out when there is an actual need and not as a luxury.
  • Debt consolidation seems to be the major reason for taking out a loan pre crisis and post crisis. Why do people take out so much loan for debt consolidation?

Debt consolidation is where someone obtains a new loan to pay out a number of smaller loans, debts, or bills that they are currently making payments on. In doing this they effectively bring all these debts together into one combined loan with one monthly payment. Since this is bringing multiple debts together and combining them into one loan, this is referred to as “consolidating” them. That’s why it’s called a debt consolidation loan.

Let’s look at some of the demographics of the people who have taken out debt consolidation loans.

What is the occupation of poeple who take out the debt consolidation loans?

  #sort data to top 10 as too many to visualise
  tally_occupation <- function(df_in) {
      result <- df_in %>%
      group_by(Occupation) %>%
      tally(sort = T) %>% 
      slice(2:11)
      
    return(result)
  }

  pre_df.occupation <- tally_occupation(pre_df)
  
  post_df.occupation <- tally_occupation(post_df)
  
  p1 <- ggplot(pre_df.occupation, aes(x = Occupation, y = n/sum(n)) ) + 
    geom_bar(stat="identity", na.rm = TRUE) +
    theme(axis.text.x = element_text(angle = 90)) +
    ggtitle('Top Pre Crisis Occupations')+
    ylab("Count") +
    xlab(NULL) +
    scale_y_continuous(labels=scales::percent, limits=c(0,0.35))
  
  p2 <- ggplot(post_df.occupation, aes(x = Occupation, y = n/sum(n)) ) + 
    geom_bar(stat="identity", na.rm = TRUE) +
    theme(axis.text.x = element_text(angle = 90)) +
    ggtitle('Top Post Crisis Occupations') +
    ylab(NULL) +
    xlab(NULL) +
    scale_y_continuous(labels=scales::percent, limits=c(0,0.35))
  
  grid.arrange(p1,p2,ncol=2)

There is a very little change in the trend of occupations of borrowers before and after the crisis. Professionals ask for the maximum percentage of loans. A thing to note however, of the top 10, 6 of the professions are high earning professions. Now lets look at it further and see that from the whole batch, how many have taken out debt consolidation loans?

  tally_occupation <- function(df_in, col) {
    #Takes in a df with list of columns, mutates to get a column where debt consolidation is true,
    #groups by occupation and tallys. Then filters df with the list of rows to keep.
    result <- df_in %>%
    mutate(new = ifelse(ListingCategory..numeric. == 1,1,0)) %>%  
    group_by(Occupation, new) %>%
    tally(sort = T) %>%
    filter(Occupation %in% col)
    
    return(result)
  }
  
  col <- c('Administrative Assistant', 'Analyst', 'Clerical','Computer Programmer','Executive','Professional','Sales - Commission', 'Sales - Retail', 'Teacher', 'NA')
  pre_df.occupation <- tally_occupation(pre_df, col)
  
  col <- c('Accountant/CPA','Administrative Assistant', 'Analyst','Computer Programmer', 'Executive', 'Nurse(RN)', 'Professional', 'Sales - Commission', 'Skilled Labor', 'Teacher')
  post_df.occupation <- tally_occupation(post_df, col)
  
  p1 <- ggplot(data = pre_df.occupation, 
         aes(x = Occupation, y = n/sum(n), fill = new) ) + 
    geom_bar(stat="identity", na.rm = TRUE) +
    theme(axis.text.x = element_text(angle = 90)) +
    ggtitle('Top Pre Crisis Occupations')+
    ylab("Count") +
    xlab(NULL) +
    scale_y_continuous(labels=scales::percent, limits=c(0,0.35)) + 
    theme(legend.position = "none")
  
 p2 <- ggplot(post_df.occupation, 
        aes(x = Occupation, y = n/sum(n), fill = new) ) + 
    geom_bar(stat="identity", na.rm = TRUE) +
    theme(axis.text.x = element_text(angle = 90)) +
    ggtitle('Top Post Crisis Occupations')+
    ylab(NULL) +
    xlab(NULL) +
    scale_y_continuous(labels=scales::percent, limits=c(0,0.35))+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())
  
  grid.arrange(p1,p2,ncol=2)

This is where we see the change in the trends of Loans before and after the crisis. * Before the crisis, a very small percentage of the loans were debt consolidation loans, with professionals have about 5% of all loans be debt consolidation loans. * After the the crisis, we see a majority of the loans are debt consolidation loans in every occupation. Taking the professional category only, there is an increase from 5% to 21% showing how most of the loans after crisis were debt consolidation loans. From the categories graph we see that almost 60% of the total loans post crisis were Debt Consolidation loans.

Which State Has More Loans?

Heat Map plotting has been leanrt from here

  states <- map_data("state")
  #change the abbreviations to state names
  pre_df$region <- tolower(abbr2state(pre_df$BorrowerState))
  post_df$region <- tolower(abbr2state(post_df$BorrowerState))
  #check
  head(post_df$region)
## [1] "colorado"   "georgia"    "minnesota"  "new mexico" "kansas"    
## [6] "california"
  make_df_merged <- function(df_in){
     #tally and merge to get long and lattitude
    df_merge <- df_in %>%
       filter(!is.na(region)) %>%
       group_by(region) %>%
       summarise(mean = round(mean(LoanOriginalAmount),1), n = n())
    return(df_merge)
  }

  state_merged_df <- function(df_in, states) {
    #merge the two dfs. and sort. Return result. 
    result <-  merge(states, df_in, by ="region", all.x = TRUE) %>%
       arrange(order)
    return(result)
  }  

  get_names <- function(df_in, states){
    #get the text names and merge df and sort and return result. 
    snames <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
    resultnames <- merge(snames, df_in, by="region", all.x = TRUE) 
    return(resultnames)
  }
  
  df_merge <- make_df_merged(pre_df)
  pre_df_geo <- state_merged_df(df_merge,states) 
  snames <- get_names(df_merge, states)
  
 ggplot(pre_df_geo , aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = mean))+
  geom_text(data=snames, aes(long, lat, label=mean),colour='white', check_overlap = TRUE)+
   ggtitle('Pre Crisis Mean Loan Amount By State')

The highest mean amount borrowed is by one of the smallest states i.e New Jersy while the rest of the states seem to have similar mean amounts borrowed

 ggplot(pre_df_geo , aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = n))+
  geom_text(data=snames, aes(long, lat, label=n),colour='white', check_overlap = TRUE)+
  ggtitle('Pre Crisis Number of Loans By State')

By the number of loans asked for, the California seems to have the highest number. New Jersey, with its high mean amount seems to be down the ladder with its low number of loan counts.

  df_merge <- make_df_merged(post_df)
  post_df_geo <- state_merged_df(df_merge,states) 
  snames <- get_names(df_merge, states)
  
  ggplot(post_df_geo , aes(x = long, y = lat)) + 
    geom_polygon(aes(group = group, fill = mean))+ 
    geom_text(data=snames, aes(long, lat, label=mean),colour='white', check_overlap = TRUE)+
    ggtitle('Post Crisis Mean Loan Amount By State')

We see that many states have higher mean loan amounts as compared to before and even each other because there are many more lighter blues than before.

 ggplot(post_df_geo , aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = n))+
  geom_text(data=snames, aes(long, lat, label=n),colour='white', check_overlap = TRUE)+
  ggtitle('Post Crisis Number of Loan By State')

However, the number of loans by state seem to follow a similar patter as before with California leading the way in the number of loans after the crisis even though the mean loan amount was less.

Income Range Vs Borrower Rate

p1 <- ggplot(data = pre_df, aes(x = StatedMonthlyIncome*12, y = BorrowerRate, color=IsBorrowerHomeowner)) +
 geom_point(stat="identity", alpha = 1/10, size = 0.75,  na.rm = TRUE) +
 scale_x_continuous(limits=c(min(pre_df$StatedMonthlyIncome*12), quantile(pre_df$StatedMonthlyIncome*12, 0.99))) +
 xlab(NULL) +
 ggtitle('Pre Crisis Income Vs Borrower Rate')

p2 <- ggplot(data = post_df, aes(x = StatedMonthlyIncome*12, y = BorrowerRate, color=IsBorrowerHomeowner)) +
 geom_point(stat="identity", alpha = 1/10, size = 0.75,  na.rm = TRUE) +
 xlab("Yearly Income of the Borrower") +
 scale_x_continuous(limits=c(min(pre_df$StatedMonthlyIncome*12), quantile(pre_df$StatedMonthlyIncome*12, 0.99))) +
 ggtitle('Post Crisis Income Vs Borrower Rate')

ggarrange(p1, p2, nrow=2, common.legend = TRUE, legend="bottom")

Common legend code from here

Income range was given in a range formats, hence I chose the stated monthly income and multiplied it by 12 to get yearly income, which is much easier to play aroud with. I compared the yearly income with the borrower rate to see if there’s any connection between the two. I see that it is all over the place i.e there is no connection present, so I added the IsBorrowerHomeOwner field to check if there’s any relation. I see that yearly income is a major factor in determining if the borrower is a homeowner or not. For both before and after the crisis, an income of less than 50000 per annum gives a very low probability of having a home. Borrower Rate vs income is all over the place though. Before the crisis, the borrower rate went to 0.5 but after the crisis the borrower rate is capped at 0.3 no matter what the income.

Income Range Vs Loan Amount Asked For

ggplot(data = pre_df, aes(x = StatedMonthlyIncome*12, y = LoanOriginalAmount, color=CreditGrade)) +
 geom_point(stat="identity", alpha = 1/5, size = 0.75,  na.rm = TRUE) +
 xlab("Yearly Income of the Borrower") +
 scale_x_continuous(limits=c(min(pre_df$StatedMonthlyIncome*12), quantile(pre_df$StatedMonthlyIncome*12, 0.99))) +
 scale_fill_brewer(type = 'qual') +
 ggtitle('Yearly Income Range Vs Loan Amount Before the Crisis')

ggplot(data = post_df, aes(x = StatedMonthlyIncome*12, y = LoanOriginalAmount, color=ProsperRating..Alpha.)) +
 geom_point(stat="identity", alpha = 1/5, size = 0.75,  na.rm = TRUE) +
 xlab("Yearly Income of the Borrower") +
 scale_x_continuous(limits=c(min(pre_df$StatedMonthlyIncome*12), quantile(pre_df$StatedMonthlyIncome*12, 0.99))) +
 scale_fill_brewer(type = 'qual') +
 ggtitle('Yearly Income Range Vs Loan Amount After the Crisis')

Since the credit Grade and the Prosper Rating are the same, only categorized before and after the crisis, I decided to see how the ratings are affected by a combination of loan amount vs income. While the scatter plot is scttered, there is a marked difference between the pre and post crisis plots. * Higher Loan amounts to for the same income range after the crisis which can indicate that people are asking for more loans than before. * A marked change in Prosper ratings/ Credit Ratings on the $4000, $5000, $10000 and $20000 marks on the Y axis after the crisis while before the crisis, there isn’t a marked change seen. Infact, almost all of the blue and purple points are below the $5000 mark which can highlight a more cautious way of giving a rating to the borrower.

Estimated Loss and Estiated Return based on Debt to Income Ration showing ProsperRating

ggplot(data = post_df, aes(x = EstimatedLoss, y = DebtToIncomeRatio, color=ProsperRating..Alpha.)) +
  geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
 ggtitle('Post Crisis Debt to Loss Comparision of Loans')+ 
    theme(legend.position = "none")

ggplot(data = subset(post_df, EstimatedReturn > 0), aes(x = EstimatedReturn, y = DebtToIncomeRatio, color=ProsperRating..Alpha.)) +
    geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
    ggtitle('Post Crisis Debt to Return Comparision of Loans')+ 
    theme(legend.position = "none")

ggplot(data = post_df, aes(x = EstimatedEffectiveYield, y = EstimatedLoss, color=ProsperRating..Alpha.)) +
    geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
    ggtitle('Yeild Vs Estimated Loss')+ 
    theme(legend.position = "none")

ggplot(data = subset(post_df, EstimatedReturn > 0), aes(x = EstimatedEffectiveYield, y = EstimatedReturn, color=ProsperRating..Alpha.)) +
    geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
    ggtitle('Yeild Vs Estimated Return')+ 
    theme(legend.position = "none")

Now we can clearly see what the Prosper rating after the crisis was dependant on i.e Estimated Loss. The more the loss that was estimated, the worse your Prosper Rating would be. Plots 1 and 3 are important in showing this relationship, where the color changes with the change of the Estimated Loss. This highlights that the rating was based on the Estimated Loss of that could be incurred from the borrower.

I also checked the relationship with the Estimated Return column, but it is clearly seen that the there is no connection to the Rating with the return with the coloured dots all over the graph.

Another highly interesting graph to see is the plot of EstimatedEffectiveYeild vs the Estimated Return Plot. Once again we see that there is a proper change in the colour i.e the Prosper Rating showing how the Rating is affected by the Yeild. An interesting observataion is that there is a solid straight line through the graph. This is when the Yeild equals to the return and when the Loss equals 0. The colors depend then on only the Effective Yeild and not the loss.

How many borrowers are also home owners?

paste((count(subset(df, df$IsBorrowerHomeowner == 'True')) / count(df))*100, "% of borrowers are also homeowners")
## [1] "50.4471769486646 % of borrowers are also homeowners"

How many home owners have taken a debt consolidation loan?

paste((count(subset(df, df$IsBorrowerHomeowner == 'True' & df$ListingCategory..numeric. == 1)) / count(df))*100, "percent of the homeowners have also taken out debt consolidations loans") 
## [1] "27.4142727998806 percent of the homeowners have also taken out debt consolidations loans"

What percentage of the borrowers are Debt Consolidation ?

paste((count(subset(df, df$ListingCategory..numeric. == 1)) / count(df))*100, "percent of the borrowers have taken out debt consolidations loans") 
## [1] "51.1756497011506 percent of the borrowers have taken out debt consolidations loans"

I did not think there was a need to make a visual to answer this question because of its obvious nature. I am looking at numbers and thus, I can see that loans for homeowners are there and many of the homeowners have more than one loan, thus there is a need to take out debt consolidation loans.

Let’s look at the actual loans and what they are affected by:

What are the summary statistics for the loan amounts approved?

summary(pre_df$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4500    6166    8000   25000
summary(post_df$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9077   13500   35000
# BOX PLOT FOR STATED MONTHLY INCOME. 
  boxplot(pre_df$LoanOriginalAmount, post_df$LoanOriginalAmount,
    main = "Loan Amounts",
    xlab = "Pre Crisis and Post Crisis",
    ylab = "Amounts")

Before the crisis, there loan amounts taken are less than after the crisis. While the base loan amount is the same for both, we see a higher range after the crisis, with a higher mean and median.

The simplist plot i.e the boxplot also highlights this aspect with the post crisis mean and median being higher than the pre- crisis mode and median.

What is the monthly income of the borrowers?

# BOX PLOT FOR STATED MONTHLY INCOME. 
  boxplot(pre_df$StatedMonthlyIncome, post_df$StatedMonthlyIncome,
    main = "What is the monthly income of the Bowrrowers",
    xlab = "Pre Crisis and Post Crisis",
    ylab = "Income")

  boxplot(pre_df$StatedMonthlyIncome, post_df$StatedMonthlyIncome ,
        main = "What is the monthly income of the Bowrrowers",
        xlab = "Pre Crisis and Post Crisis",
        ylab = "Income",
        outline=FALSE)

It seems that with the higher loan amounts of the borrowers, their income range has also increased after the crisis with the outliers so high, that I could not see the actual boxplot. It is after plotting the boxplot without the outliers do I see the trend that the income after the crisis is higher. This can be associated with the higer cost of living after the crisis and with the increase of the basic income needed to live along with inflation. All in all $1 in 2008 → $1.17 in 2018 showing how the cost of living has increased with the increase in income. This can also give reason to the higher loan prices.

On Time vs Less than one month late vs more than one month late payments

p1 <-  ggplot(data = pre_df) +
    geom_point( aes(x = MonthlyLoanPayment, y = OnTimeProsperPayments), stat="identity", alpha = 1/5, size = 0.75, color='blue') +
    geom_point( aes(x = MonthlyLoanPayment, y = ProsperPaymentsLessThanOneMonthLate), stat="identity", alpha = 1/5, size = 0.75, color = 'red') +
    geom_point( aes(x = MonthlyLoanPayment, y = ProsperPaymentsOneMonthPlusLate), stat="identity", alpha = 1/5, size = 0.75, color = 'green') +
    ylab("Count") +
    xlab('Loan Payment Amount') +
    ggtitle('Pre Crisis Comparision of Loan Payments')

p2 <- ggplot(data = post_df) +
    geom_point( aes(x = MonthlyLoanPayment, y = OnTimeProsperPayments), stat="identity", alpha = 1/5, size = 0.75, color='blue') +
    geom_point( aes(x = MonthlyLoanPayment, y = ProsperPaymentsLessThanOneMonthLate), stat="identity", alpha = 1/5, size = 0.75, color = 'red') +
    geom_point( aes(x = MonthlyLoanPayment, y = ProsperPaymentsOneMonthPlusLate), stat="identity", alpha = 1/5, size = 0.75, color = 'green') +
    ylab("Count") +
    xlab('Loan Payment Amount') +
    ggtitle('Post Crisis Comparision of Loan Payments')

grid.arrange(p1, p2, ncol=2)

Apart from the lesser number of loans taken before the crisis, we see that a very small percentage of the loan payments were less than a month late and almost none of them were more than one month late before the crisis. After the crisis though, we see that a significant percentage of loan payments were late and many even more than one month late.

Time Vs Loans

  df.time <- df %>%
  group_by(Date) %>%
  summarise(n = n(), mean = mean(LoanOriginalAmount), median = median(LoanOriginalAmount), sum = sum(LoanOriginalAmount))

 ggplot(data = df.time) +
    geom_line( aes(x = Date, y = median, group = 1), color = 'blue')+
    geom_line( aes(x = Date, y = mean, group = 1),  color='red') +
    theme(axis.text.x = element_text(angle = 90)) +
    ylab("Count") +
    ggtitle('Mean and Median Loan Amounts')

For this duration, the median is less than the mean. An interesting thing to note is that there is a massive drop from the last 2008 quater to the mid 2009. These are the years of the Great Financial Crisis and this graph proves that the Loan trends were affected significantly by the Great Crisis of 2008.

Final Plots and Summary

The aim for the final plots is to combine a 2-4 plots into one to form a summary of the trends I have tried to highlight during the analysis.

Comparision Of Debt Consolidation Loans Before and After The Crisis.

Knowing that the debt consolidation loans make about 51.17% of the total borrowers, it is important to know what the demographics of these borrowers are. Their occupations is what I have plotted. It is also important to segament the data into before and after the crisis to highlight the drastic change in loan taking. Where debt consolidation was a minor part of the total loan categories before the Crisis, it became the number 1 reason for taking loans after the Crisis.

#supress warnings 
oldw <- getOption("warn")
options(warn=-1)
library(grid)

pre_df.debt <- subset(pre_df, ListingCategory..numeric. == 1)
post_df.debt <- subset(post_df, ListingCategory..numeric. == 1)

p1 <- ggplot(data = pre_df.occupation, 
       aes(x = Occupation, y = n/sum(n), fill = new) ) + 
  geom_bar(stat="identity", na.rm = TRUE) +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle('Top Pre Crisis Occupations')+
  ylab("Count") +
  xlab(NULL) +
  scale_y_continuous(labels=scales::percent, limits=c(0,0.35)) + 
  theme(legend.position = "none")

p2 <- ggplot(post_df.occupation, 
      aes(x = Occupation, y = n/sum(n), fill = new) ) + 
  geom_bar(stat="identity", na.rm = TRUE) +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle('Top Post Crisis Occupations')+
  ylab(NULL) +
  xlab(NULL) +
  scale_y_continuous(labels=scales::percent, limits=c(0,0.35))+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())
lay <- rbind(c(1,1,1,2,2,2),
             c(1,1,1,2,2,2))

grid.arrange(p1,p2, layout_matrix = lay)

As seen from the graph above, where the light blue color highlights are percentage of debt consolidation loans while the dark blue color highlights the other reasons for taking loans, it can be seen wht debt consolidation makes up about 51% of the data. Prior to the crisis, there was no reason to take debt consoldation loans due to people’s income level being at par with their loan debt. However, as mentioned before, a debt consolidation loan combines all other loans into one big loan with low interest so that an individual only has to pay one monthly payment. After the crisis it can be seen that the loan taking has completely switched, with the majority being debt consolidation loans and only a minority being loans for other purposes. It should be mentioned as I know from plotting the graph before, that after the crisis people started taking loans for luxury items as well such as a boat, motor cycle and vacations. In light of this information, it is still astonishing to see the proportion of debt consolidation loans over other loans.

Coming to the occupations of the people who are asking for the loans, it can be seen that loans are mostly given to people with an already steady and high income such as Analyst, or computer programmer or Executive. These people have the highest earning jobs and thus can proof their income and ability to pay back their loans. However, it is the professional category that has the maximum number of loans in pre-crisis and post crisis categories.

Factors Affecting Prosper Rating (Applicable after July 2009)

Given the numerous facors given for the Prosper rating, I wanted to know which is the most important and what does the Prosper rating depend on. The data is majority after the crisis data i.e after July 2009 so I had enough data to make a conclusion.

p1 <- ggplot(data = post_df, aes(x = DebtToIncomeRatio, y = EstimatedLoss, color=ProsperRating..Alpha.)) +
      geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
      theme(legend.position = "none")
p2 <- ggplot(data = subset(post_df, EstimatedReturn > 0), aes(x = DebtToIncomeRatio, y = EstimatedReturn, color=ProsperRating..Alpha.)) +
      geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
      theme(legend.position = "none")
p3 <- ggplot(data = post_df, aes(x = EstimatedEffectiveYield, y = EstimatedLoss, color=ProsperRating..Alpha.)) +
      geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
      theme(legend.position = "none")
p4 <- ggplot(data = subset(post_df, EstimatedReturn > 0), aes(x = EstimatedEffectiveYield, y = EstimatedReturn, color=ProsperRating..Alpha.)) +
      geom_point(stat="identity", alpha = 1/10 , size = 0.75,  na.rm = TRUE) +
      theme(legend.position = "none")

figure <- ggarrange(p1, p2, p3, p4, ncol=2, nrow=2, common.legend = TRUE, legend="bottom")
annotate_figure(figure,
                top = text_grob("Factors Affecting Prosper Rating.", color = "Black", face = "bold"))

From the given plot, which is an amalgamation of four plots which plots four different variables against each other, I can easily see that Propser Rating depends solely on Estimated Loss. The greater your estimated loss ratio, the worse your rating is given to be. Estimated loss is then calculated from a variety of different elements such as demographics (occupation, employement status, investors) and income (monthly income, other loans, number of delinquencies and amount delinquent)

I can also safely say that Prosper rating is dependant on estimated loss due to the proper lines formed when plotted with Estimated Loss while there is no lines seen with the plots of Estimated Return. This is also highlighted with the 4th graph of Estimted return vs Estimated Effective Yeild. It is obvious the two variables are correlated. An increase in Estimated Return should also give you an increase in Estimated Effective Yeild. However, we see that the two are not lineraly related. There’s a squared correlation. This is because of the Estimated Loss variablt which highlights how the Estimated Loss variable is vital in calculating the Prosper Rating.

Comparisions of State Incomes with State Loans

I wanted to check how state incomes are mapped with state loans and thus, I plotted the mean of the two giving me an indication of how the states handle debt.

states <- map_data("state")
#change the abbreviations to state names
df$region <- tolower(abbr2state(df$BorrowerState))
  
df_merge <- df %>%
       filter(!is.na(region)) %>%
       group_by(region) %>%
       summarise(meanLoan = round(mean(MonthlyLoanPayment),2) , meanIncome = round(mean(StatedMonthlyIncome),2))

df_geo <- merge(states, df_merge, by ="region", all.x = TRUE) %>%
       arrange(order)
snames <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
snames <- merge(snames, df_merge, by="region", all.x = TRUE) 

ggplot(df_geo , aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = meanIncome))+
  geom_text(data=snames, aes(long, lat, label = meanIncome),colour='white', check_overlap = TRUE)+
  ggtitle('Mean Income Amount By State')

ggplot(df_geo , aes(x = long, y = lat)) +
  geom_polygon(aes(group = group, fill = meanLoan))+
  geom_text(data=snames, aes(long, lat, label = meanLoan),colour='white', check_overlap = TRUE)+
  ggtitle('Mean Loan Amount By State')

This chart shows the recent debt per state in the US where California, Texas and NewYork have massive amounts of Debt.

The data given, only has Prosper oans till 2014 and thus, it is imporant to see the current world scenario. As mentioned before, the incomes and loan amounts have increased 10 fold after the crisis. In the chart, we see that the states with the highest average incomes are paying the least amounts of debt, while states such as California is which have low averge income are paying a high amount for their debt. This can show the start of the Dbet crisis that the world is not facing with more defaulters than on time paying individuals.

Conclusion and Reflection

I was interested in analyzing this particular set of data because of my interest in world affairs and the debt crisis. The Debt Crisis was particularly interesting because it has its causes stuck in the 1980s. Bad governemtn policies and irresponisble debt taking trends along with a lack of awareness seems to be the key reasons for this debt crisis. In 2008, U.S. households lost an estimated 18% of their net worth, equaling approximately an 11.2 trillion loss. In 2011, the U.S. debt reached 100% of its GDP and now the fedral goverments total debt stands at an enourmous 21.97 trillion. It’s all connected.

Thus, I undertook the effort to try and figure out the consequences of the Financial Crisis of 2008-09. I found it was tougher than expected, because I was unfamiliar with the syntax, the function calls, and the overall behavior of R. Maybe if I had done this in Jupyter Notebooks, it would be better as I am fimilair with it.

I was pleased with how easy it is to produce a simple map of the world in R using the states library. And plotting scatter plots and lines was easy enough and straight forward. But I spent a lot of time looking for advanced plotting packages that should make my life easier. After wasting a lot of time, I realized I felt compelled to code it myself and get the job done correctly. These are the problems of learning a new language so quickly.

I got hung up multiple times my code started behaving oddly. For example when I was plotting a the world map and the figure was not darwing and only the background was showing. I had to ask my fellow Bertlesmann Scholars for help and some of them were just as stuck on it as I was. On the other hand, it was nice to be able run code in a function, and access the results from the console.

At the end of it all, R programming is not a whole lot easier than Java or C (which I have done a lot of coding in), except for really simple data structures. If asked for a preference, I’d chose Python any day. However, given enough time with R as I have had with Python, it would probably be hard to decide.

I feel like I would have been able to do this project better in Python as I am more fimiliar with the syntax having being using the language as my main for the past couple of years and with the looming deadline, there was only a finite amount of time I could have spent on this project and trying to learn R.